Ethics, legal aspects: why should we care ?
Two main reasons (according to Matt Salganik)
Ethics, legal aspects: why should we care ?
Two main reasons (according to Matt Salganik)
1. Fear-Based
At the individual & at the collective level, we have an incentive to avoid trouble.
And there can be plenty
Ethics, legal aspects: why should we care ?
Two main reasons (according to Matt Salganik)
1. Fear-Based
2. Hope-Based
We are currently limiting ourselves, while some innovative/ revealing research could be developed.
Ethics, legal aspects: why should we care ?
Two main reasons (according to Matt Salganik)
1. Fear-Based
2. Hope-Based
⇒ Overall, we have no choice
So what should we do ? This is where the problem start
Today: a quick introduction to the legal & ethical aspects you should be paying attention to.
Watch out:
- I am not a lawyer
- Ethical norms evolve (stay tuned)
So what should we do ? This is where the problem start
2 dimensions: scraping / data management
x
2 aspects: legal aspects / ethical aspects
Rule #1 is that scraping a website is often
ILLEGAL
(yes, it’s illegal).
Several issues with scraping :
- Infringement of database rights
- Copyright
Rule #1 is that scraping a website is often
ILLEGAL
Websites enforce these rules differently
- Some will let you collect the information you are seeking
- Some will try to limit your impact
+ Positively, by offering an API
+ Negatively, by placing some constraints on your actions
- Some websites will actively go after you
Rule #1 is that scraping a website is often ILLEGAL
Remember
it is not because you can do it that it is legal
Rule #1 is that scraping a website is often
ILLEGAL
BUT
There are ‘research exceptions’ to these rules
In particular, there exists a European directive that specifies the perimeter of the exception for ‘text and data mining’
Here is the text of the directive, the French transposition from 2021, and some plain text explanations.
Some (vague) details about this exception
In France (~Europe), it is possible to carry out text and data mining on any single resource on which you have access to legally. No limitation in volume nor in time, but conditions.
Some (vague) details about this exception
This includes: - copyrighted material you have purchased (archive of a newspaper, database); - data you collected via a subscription (newspaper databases paid for by a university library); - content available on the internet; - on the condition the acquisition was licit
Some (vague) details about this exception
Conditions:
- Access to the data was legal.
- No theft, no cheating
Data protection is key: Taking measures to ensure that the data is stored in a secure location and that measures have been taken to avoid any leak/loss.
Ethics
Even if it is allowed, don’t forget to apply good practices:
- Is there an API? If yes, please use it
- Introduce yourself (give an email address)
- Be kind (add some pauses)
- Be sober, download only once
The problem with Human and social sciences is that they deal with Human Subjects.
This make all actions fall under the remit of the GDPR (General Data Protection Regulation)
GDPR is a vast, complex code.
To make things worse, its interpretation often depends on who reads it.
Rule #1: any data that includes either identifying or sensitive information is subject to GDPR provisions.
GDPR is a vast, complex code.
Personal data: is any information that can lead (directly or with other information) to the identification of a living natural person (GDPR 4, Recital 27).
GDPR is a vast, complex code.
Personal data: is any information that can lead (directly or with other information) to the identification of a living natural person (GDPR 4, Recital 27).
- A name, an e-mail address
- A username or handle – and everything (tweets, forum posts, etc.) connected to this username!
- An IP-address (if not leading to a VPN server)
- Images (of a face) and voice recordings
- All kinds of contextual information that can lead to identifying a person: ”I am a woman, a sociologist, working at ENSAE. I am also originally from Italy, and…”
- Watch out : data can be merged and become identifying (email leading to a cell phone)
GDPR is a vast, complex code.
Personal data: is any information that can lead (directly or with other information) to the identification of a living natural person (GDPR 4, Recital 27).
Personal data should be processed for ”specified, explicit and legitimate purposes” (GDPR 5)
Personal data
From here on, we enter into complex territory, since you can process this data in the following conditions:
- Consent
- Fulfilling a contract or legal obligation
- Protecting the vital interests of the data subject
- Carrying out a task in the public interest ← Go for this one - The legitimate interests of the controller (e.g. a landlord will need to collect some personal data of tenants in order to be able to rent out apartments).
Identifying information
The GDPR identifies special categories of personal data (GDPR:9) as follows:
- Racial or ethnic origin
- Political opinions and religious or philosophical beliefs
- Trade union membership
- Information about health, sex life and sexual orientation
- Genetic and biometric data that can lead to the identification of a living person (but not images in general).
Difference:
Processing personal data is allowed ”if…”, processing special categories is prohibited ”unless…”
→ go talk to your data officer !
This looks daunting.
Yet, avoid avoiding doing research because of this.
5 (non-legally) sufficient advice
5 (non-legally) sufficient advice
1. Store your data adequately
- Are your files on a secure server?
- Is your disk encrypted?
- Do you protect the name of the people you interview?
- Where do you store copies of your material?
5 (non-legally) sufficient advice
1. Store your data adequately
2. Ask yourself: are you affecting your subjects?
- Non/Obstrusive methods
- Is it worth it? Is it scientifically legitimate?
5 (non-legally) sufficient advice
1. Store your data adequately
2. Ask yourself: are you affecting your subjects?
3. Do you anonymize properly?
- Change/Erase all names
But: Is this enough? Probably not
Reason #1: anonymity != confidentiality
Reason #2: Re-identification can happen with limited data (Sweeney, 2000)
5 (non-legally) sufficient advice
1. Store your data adequately
2. Ask yourself: are you affecting your subjects?
3. Do you anonymize properly?
4. ‘But the data is already public’ is not a good excuse
5 (non-legally) sufficient advice
1. Store your data adequately
2. Ask yourself: are you affecting your subjects?
3. Do you anonymize properly?
4. ‘But the data is already public’ is not a good excuse
5. Should you release the data
- A growing demand (for replication, for cumulativity)
- All good, but should not happen at the expense of your subjects.
⇒ Tread carefully, and borrow from ethnographers